Document Level Sentiment Analysis with Deep Learning Models#

Load Twitter Datasets#

Twemlab Goldstandard dataset
Size: 994
Polarity: (1 pos, 0 neu, -1 neg)
Sample of the data:
  Text Label sentiment_label
472 Woman and her dog fight off sex attacker in Kings Norton park anger/disgust -1
970 Birmingham are consulting on extending shared-use paths in Sheldon Park an important cycle link to Marston Green. none 0
145 birminghammail-new horror book set around the Lickey Hills. none 0
SemEval Goldstandard dataset size: 3713
Polarity: (1 pos, 0 neu, -1 neg)
Sample of the data:
  Text Polarity sentiment
2811 Monday nights are a bargain at the $28 prix fix - this includes a three course meal plus *three* glasses of wine paired with each course. positive 1
886 very good breads as well. positive 1
1270 On the other hand, if you are not fooled easily, you will find hundreds of restaurants that will give you service and ambiance that is on par with Alain Ducasse, and food that will outshine in presentaion, taste, choice, quality and quantity. negative -1
AIFER example dataset size: 11177
Polarity: none (unlabelled data)
Sample of the data:
  Date Text Language
6519 2021-07-20 16:31:14 @karlheinz_e Da liegt er falsch. Hier haben alte links-grüne-versiffte Soldaten mit grünen Gutmenschen u.a. auch Flüchtlinge unterstützt. AfD-Politiker habe ich hier nicht gesehen. de
7712 2021-07-22 22:02:41 @MDegen55 All Nathan Fillion Fans of the World. Good Night and happy Friday. ♥️ https://t.co/PtWuzODGwB en
11116 2021-07-21 18:20:17 @CWerdinger Hier im Katastrophengebiet spielt Corona keine Rolle mehr,auch kein Hass oder Hetze, hier zählt nur noch Hilfsbereitschaft ❤️❤️❤️Masken interessieren hier niemanden mehr, wer Maske tragen will kann das hier im Katastrophengebiet tun, wer nicht brauch auch keine zu tragen. 👍☺️ de

Transformers-based NLP Models#

Why Transformers?#

In recent years, the transformer model has revolutionized the field of NLP. This ‘new’ deep learning approach has been highly successful in a variety NLP tasks, including sentiment analysis. The transformer model offers several advantages over traditional machine learning and even other deep learning approaches and have been shown to outperform traditional machine learning and other deep learning methods on NLP tasks, particularly sentiment analysis. Some of the key advantages it has are:

  • The encoder-decoder framework: Encoder generates a representation of the input (semantic, context, positional) and the decoder generates output. Common use case: sequence to sequence translation tasks.

  • Attention mechanisms: Deals with the information bottleneck of the traditional encoder-decoder architecture (where one final encoder hidden state is passed to decoder) by allowing the decoder to access the hidden states at each step and being able to prioritise which state is most relevant.

  • Transfer learning (i.e. fine-tuning a pre-trained language model)



A note on Attention#

In transformers, multi-head scaled-dot product attention is usually used. This attention mechanism allows the Transformer to capture global dependencies between different positions in the input sequence, and to weigh the importance of different parts of the input when making predictions.

In scaled dot-product attention a dot product between the query, key, and value vectors is computed for each position in the sequence. The attention mechanism is repeated multiple times with different linear projections (hence “multi-head”) to capture different representations of the input.

Code implementation
class AttentionHead(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)

    def forward(self, hidden_state):
        attn_outputs = scaled_dot_product_attention(
            self.q(hidden_state), self.k(hidden_state), self.v(hidden_state))
        return attn_outputs
     

class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads
        self.heads = nn.ModuleList(
            [AttentionHead(embed_dim, head_dim) for _ in range(num_heads)]
        )
        self.output_linear = nn.Linear(embed_dim, embed_dim)

    def forward(self, hidden_state):
        x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
        x = self.output_linear(x)
        return x

Here’s a visual representation of the attention machism at work with a demo text “The hurricane trashed our entire garden”:

from IPython.display import display, HTML
display(HTML('https://raw.githubusercontent.com/Christina1281995/demo-repo/main/neuron_view.html'))
Layer: Head: Attention:
from IPython.display import display, HTML
display(HTML('neuron_view.html'))
Layer: Head: Attention:

Types of Transformers#

A useful way of differentiating between the many transformers-based models that have sprung up in recent years is by their use of the transformer-blocks (encoder and decoder).

The encoder block’s main role is to “update” the input embeddings to produce representations that encode some contextual information (called the context vector). Many models make use only of the encoder block and add a linear layer as a classifier. BERT is perhaps the most prominent example of an encoder-based architecture (the name is literally “bidirectional encoder representations from transformers”).

The decoder block uses the context vector from the encoder to generate an output sequence. Like the encoder, the decoder also computes self-attention scores and processes the context vector through multiple feedforward layers. The decoder also includes an attention mechanism that allows it to focus on specific parts of the input sequence when generating the output. Decoder-based models are exceptionally good at predicting the next word in a sequence and are therefore often used for text geneeration tasks. Progress here has been fuelled by using larger datasets and scaling the language models to larger and larger sizes (GPT-3 has 175 billion parameters). The most famous example of decoder-based models are the Generative Pretrained Transformer (GPT) models by OpenAI.



Two ways of using Pre-trained Transformer Language Models#

1. For Feature Extraction:

To extract the features of the encoder and then separately train a clssifier using the hidden states. This method essentially freezes the body’s weights during training.

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_ckpt).to(device)

# tokenize batch-wise but set batch size to none so all in one go
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

def extract_hidden_states(batch):
    # Place model inputs on the GPU
    inputs = {k:v.to(device) for k,v in batch.items() 
              if k in tokenizer.model_input_names}
    # Extract last hidden states
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state
    # Return vector for [CLS] token
    return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}
    
# map tokenizer to entire dataset 
semeval_encoded = semeval.map(tokenize, batched=True, batch_size=None)

# set correct format (tensors) because that's what the model expects as input
semeval_encoded.set_format("torch",columns=["input_ids", "attention_mask", "label"])

# extract hidden states
semeval_hidden = semeval_encoded.map(extract_hidden_states, batched=True)
# only have train data (without validation set in this case)
X_train = np.array(semeval_hidden["train"]["hidden_state"])
y_train = np.array(semeval_hidden["train"]["label"])
print(f"The shape of [dataset_size, hidden_dim]: {X_train.shape}")
The shape of [dataset_size, hidden_dim]: (3041, 768)

For visualisation purposes, the 768 dimensions can be reduced down to only 2 dimensions using UMAP. It will rescale each input to lie on a 2D space with values between 0 and 1 on each axis.

from umap import UMAP
from sklearn.preprocessing import MinMaxScaler

# Scale features to [0,1] range
X_scaled = MinMaxScaler().fit_transform(X_train)
# Initialize and fit UMAP
mapper = UMAP(n_components=2, metric="cosine").fit(X_scaled)
# Create a DataFrame of 2D embeddings
df_emb = pd.DataFrame(mapper.embedding_, columns=["X", "Y"])
df_emb["label_name"] = y_train
df_emb.head()
X Y label_name
0 0.336880 8.937547 -1
1 0.071688 4.982654 1
2 1.698690 3.108218 1
3 4.818382 4.709717 1
4 2.971198 6.404373 1

These 2D vectors can now be plotted:

fig, axes = plt.subplots(1, 3, figsize=(10,4))
axes = axes.flatten()
# colors for 2D space
cmaps = ["Reds", "Oranges", "Greens"]
# labels
labels = ['negative', 'neutral', 'positive']


for i, (label, cmap) in enumerate(zip(labels, cmaps)):
    # label_name i-1 because the "label_name" column ranges from -1 to 1
    df_emb_sub = df_emb.query(f"label_name == {i-1}")
    axes[i].hexbin(df_emb_sub["X"], df_emb_sub["Y"], cmap=cmap,
                   gridsize=20, linewidths=(0,))
    axes[i].set_title(label)
    axes[i].set_xticks([]), axes[i].set_yticks([])

plt.suptitle("A 2D Visual Representation: The Extracted Features from the SemEval Dataset by the DistilBERT Encoder")
plt.tight_layout()
plt.show()
     
_images/Deep_Learning_33_0.png

2. For Fine-Tuning:

# explain methods briefly
# esp. explain why transformers and how attention works
# Expl Lexicon (VADER)
# Expl Naive Bayes
# Transformers: BERT-based, others
# Show attention mechanism 

RoBERTa#

RoBERTA (Robustly Optimized BERT Pretraining Approach) has the same architecture as BERT but marks an improved version of BERT for several reasons:

  • RoBERTa was trained on 10x as much data as was used for BERT training (160GB, compared to 16GB for BERT)

  • Dynamic masking was used during training, rather than fixed masking in BERT

  • the next sentence prediction was left out during training, which is arguably not essential especially when considering tweets. Here is a view of the average tweet length in the Twemlab dataset:

_images/Deep_Learning_38_0.png

The model used here is the cardiffnlp/twitter-roberta-base-sentiment-latest

Source: https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest?text=Covid+cases+are+increasing+fast!

This is a roBERTa-base model trained on ~124M tweets from January 2018 to December 2021 (see here, and finetuned for sentiment analysis with the TweetEval benchmark. The original roBERTa-base model can be found here and the original reference paper is TweetEval. This model is suitable for English.

From the Authors

TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification (Barbieri et al., 2020)

According to the authors of the mode, among all the available language models RoBERTa is one of the top performing systems in GLUE (Liu et al., 2019). It does not employ the Next Sentence Prediction (NSP) loss (Devlin et al., 2018), making the model more suitable for Twitter where most tweets are composed of a single sentence.

Three different RoBERTa variants were used:

  • pre-trained RoBERTabase (RoB-Bs),

  • the same model but re-trained on Twitter (RoB-RT) and

  • trained on Twitter from scratch (RoB-Tw)

Results indicate that RoB-RT perform the best for sentiment analysis and that the RoBERTa model outperforms the comparison methods in all investigated NLP tasks:

tw_rob_base_sent_lat = "cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer_tw_rob_base_sent_lat = AutoTokenizer.from_pretrained(tw_rob_base_sent_lat)
config_tw_rob_base_sent_lat = AutoConfig.from_pretrained(tw_rob_base_sent_lat)
# PT
model_tw_rob_base_sent_lat = AutoModelForSequenceClassification.from_pretrained(tw_rob_base_sent_lat)
#model.save_pretrained(tw_rob_base_sent_lat)
# testing
text = "Well isn't this just terrible..."
text = preprocess(text)
encoded_input = tokenizer_tw_rob_base_sent_lat(text, return_tensors='pt')
output = model_tw_rob_base_sent_lat(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
# Print labels and scores
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = config_tw_rob_base_sent_lat.id2label[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")
1) negative 0.9382
2) neutral 0.055
3) positive 0.0068
# apply in the form of a function so it can be called for usecase later on
def robertabase_apply(dataset):
    
    # create variable for labels (good to bad)
    labels= ['positive', 'neutral', 'negative']
    
    # lists to be filled
    cardiffroberta_sentiment_prediction = []
    cardiffroberta_sentiment_prediction_softmax = []
    cardiffroberta_sentiment_prediction_num = []
    
    # iterate over dataset
    for index, row in dataset.iterrows():
        text = row['text']
        text = preprocess(text)
        encoded_input = tokenizer_tw_rob_base_sent_lat(text, return_tensors='pt')
        output = model_tw_rob_base_sent_lat(**encoded_input)
        score = np.round(softmax(output[0][0].detach().numpy()), 4)
        label = config_tw_rob_base_sent_lat.id2label[np.argsort(score)[::-1][0]]
        cardiffroberta_sentiment_prediction.append(label)
        cardiffroberta_sentiment_prediction_softmax.append(max(score))
        # positive label
        if label == labels[0]:
            cardiffroberta_sentiment_prediction_num.append(1)
        # negative label
        elif label == labels[2]:
            cardiffroberta_sentiment_prediction_num.append(-1)
        # neutral label
        else:
            cardiffroberta_sentiment_prediction_num.append(0)


    dataset['cardiffroberta_sentiment_prediction'] = cardiffroberta_sentiment_prediction
    dataset['cardiffroberta_sentiment_prediction_softmax'] = cardiffroberta_sentiment_prediction_softmax
    dataset['cardiffroberta_sentiment_prediction_num'] = cardiffroberta_sentiment_prediction_num

    model_name = "cardiffroberta"
    
    # model name and labels will be needed later on as input variables for plotting and mapping
    print("Variables that will later be required for plotting and mapping:")
    return model_name, labels
Precision Recall Accuracy F1
Twitter Roberta Base Sent Latest 0.75 0.75 0.81 0.75

RoBERTa Example use case#

Polarity-Based Sentiment Analysis of Georeferenced Tweets Related to the 2022 Twitter Acquisition (Schmidt et a., 2023)

We performed a simple document-level, polarity-based sentiment analysis (cf. Figure 1) in Python 3.9 to categorise the Tweets as either “positive”, “neutral“ or “negative”.

https://www.mdpi.com/2078-2489/14/2/71



BERTweet#

model: finiteautomata/bertweet-base-sentiment-analysis

Source: https://huggingface.co/finiteautomata/bertweet-base-sentiment-analysis?text=I+hate+this

This is a BERTweet-base RoBERTa model trained on SemEval 2017 (~40k Tweets). It uses POS, NEG, NEU labels and is suitable for English and Spanish languages. pysentimiento is an open-source library for non-commercial use and scientific research purposes only. Please be aware that models are trained with third-party datasets and are subject to their respective licenses.

pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP tasks (Perez et al., (2021)

The aim of this research was to perform both Sentiment Analysis and Emotion Analysis on Twitter datasets and identify the best performing meodels. For Sentiment Analysis, two datasets were used: TASS 2020 Task 1 and SemEval 2017 Task 4 Subtask 1. Both datasets were labeled with general polarity using positive, negative and neutral outcomes.

A series of models were tested for the given tasks: For English, they tested BERT base, RoBERTa base, BERTweet and multilingual models, namely DistilBERT and mBERT. Spanish has lesser availability of models: they used BETO, a Spanish-trained version of BERT, and the aforementioned multilingual models. The authors utilise the BERTweet model as a base model for their sentiment analysis task.


More on BERTweet

BERTweet: A pre-trained language model for English Tweets (Nguyen et al., 2020)

BERTweet uses the same architecture as BERTbase, which is trained with a masked language modeling objective (Devlin et al., 2019). BERTweet pre-training procedure is based on RoBERTa (Liu et al., 2019) which optimizes the BERT pre-training approach for more robust performance.

The authors use an 80GB pre-training dataset of uncompressed texts, containing 850M Tweets (16B word tokens). Here, each Tweet consists of at least 10 and at most 64 word tokens.

bertweetanalyzer = create_analyzer(task="sentiment", lang="en")
# testing
text = "This is aweful"
bertweetpreprocess(text)
result = bertweetanalyzer.predict(text)
print(result.output)
print(np.round(result.probas['NEG'], 4))
NEG
0.9786
Precision Recall Accuracy F1
Twitter Roberta Base Sent Latest 0.75 0.75 0.81 0.75
BERTweet base sentiment analysis 0.75 0.72 0.79 0.73

Fine-Tuned Downstream Sentiment Analysis#

Model: Seethal/sentiment_analysis_generic_dataset

This is a BERT base model (uncased), pretrained on English language using a masked language modeling (MLM) objective. This model is uncased: it does not make a difference between english and English.

This is a fine-tuned downstream version of the bert-base-uncased model for sentiment analysis, this model is not intended for further downstream fine-tuning for any other tasks. This model is trained on a classified dataset for text classification.

# testing 
text = 'Today is an amazing day.'
text = preprocess(text)
encoded_input = tokenizer_seethal_gen_data(text, return_tensors='pt')
output = model_seethal_gen_data(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
# Print labels and scores
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = config_tw_rob_base_sent_lat.id2label[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")
1) positive 0.9969
2) neutral 0.0027
3) negative 0.0005
# apply in the form of a function so it can be called for usecase later on
def seethal_gen_data_apply(dataset):
    
    # create variable for labels (good to bad)
    labels= ['positive', 'neutral', 'negative']
    
    # lists to be filled
    seethal_gen_data_sentiment_prediction = []
    seethal_gen_data_sentiment_prediction_softmax = []
    seethal_gen_data_sentiment_prediction_num = []
    
    # iterate over dataset
    for index, row in dataset.iterrows():
        text = row['text']
        text = preprocess(text)
        encoded_input = tokenizer_seethal_gen_data(text, return_tensors='pt')
        output = model_seethal_gen_data(**encoded_input)
        score = np.round(softmax(output[0][0].detach().numpy()), 4)
        label = config_tw_rob_base_sent_lat.id2label[np.argsort(score)[::-1][0]]
        seethal_gen_data_sentiment_prediction.append(label)
        seethal_gen_data_sentiment_prediction_softmax.append(max(score))
        # positive label
        if label == labels[0]:
            seethal_gen_data_sentiment_prediction_num.append(1)
        # negative label
        elif label == labels[2]:
            seethal_gen_data_sentiment_prediction_num.append(-1)
        # neutral label
        else:
            seethal_gen_data_sentiment_prediction_num.append(0)


    dataset['seethal_gen_data_sentiment_prediction'] = seethal_gen_data_sentiment_prediction
    dataset['seethal_gen_data_sentiment_prediction_softmax'] = seethal_gen_data_sentiment_prediction_softmax
    dataset['seethal_gen_data_sentiment_prediction_num'] = seethal_gen_data_sentiment_prediction_num

    model_name = "seethal_gen_data"
    
    # model name and labels will be needed later on as input variables for plotting and mapping
    print("Variables that will later be required for plotting and mapping:")
    return model_name, labels
Precision Recall Accuracy F1
Twitter Roberta Base Sent Latest 0.75 0.75 0.81 0.75
BERTweet base sentiment analysis 0.75 0.72 0.79 0.73
Seethal Senti Analysis Generic Dataset 0.69 0.60 0.73 0.64

Challenges with Transformers#

  • Language

  • Data availability

  • Working with long documents (more than paragraphs)

  • opacity (black box)

  • bias (trained on data on the internet)


Useful for Disaster Responses?#

By analyzing the sentiment contained in large volumes of social media posts related to a disaster, models like those shown above can provide valuable insights into the public’s perception of the situation, the effectiveness of response efforts, and the needs and concerns of those affected.

Doc-Level sentimen analysis can be used to monitor the overall sentiment trend over time and identify areas where response efforts may be lacking or where additional resources may be needed. It can also be used to identify specific topics and issues that are of concern to the public, such as the availability of food, water, and medical supplies, or the level of safety in evacuation centers.

  Text Sentiment
47 @ThisFineBobo Give me 3 minutes 😂 NEU
4499 @MDegen55 GOOD MORNING BEATE. I WISH YOU A MAGIC FRIDAY. 🍀🍀 https://t.co/QaeDrHIWHC POS
2777 Just posted a photo @ Jaco's Paddock Motorsport https://t.co/81faPiQLpk NEU
Ratios between pos, neu, and neg sentiments: [1674, 753, 130]
_images/Deep_Learning_64_1.png
_images/Deep_Learning_65_0.png _images/Deep_Learning_65_1.png

Geographic Sentiments#

Make this Notebook Trusted to load map: File -> Trust Notebook

Sentiments Over Time#

Tweet Sentiments over Time (Day Intervals)